超越文字：理解分词与棒棒糖测试

语言的隐藏架构

大型语言模型（LLMs）并非像人类一样“阅读”文本。虽然我们看到的是字母和单词，但模型是以数值块的形式处理信息，这些块被称为分词。理解这一抽象概念是掌握提示工程与系统设计的第一步。

棒棒糖测试

为什么大型语言模型在反转单词“lollipop”时会遇到困难，却能立即成功反转“l-o-l-l-i-p-o-p”？

问题所在：在标准写法中，模型只将整个单词视为一个分词。它无法清晰地识别该分词内部的各个字母。
解决方案：通过用连字符连接单词，你可以强制模型将每个字母单独分词，从而提供完成任务所需的细粒度“视野”。

核心原则

分词比例：通常情况下，1个分词约等于英文中的4个字符，或大约相当于0.75个词。
上下文窗口：模型具有固定的“记忆”容量（例如4096个分词）。此限制包括你的指令和模型的回复。

基础模型与指令微调模型

基础语言模型：基于海量数据集预测下一个最可能的词语（例如，“法国的首都是什么？”之后可能是“德国的首都是什么？”）。
指令微调语言模型：通过人类反馈强化学习（RLHF）进行微调，以遵循特定指令并充当助手。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

If you are processing a document that is 3,000 English characters long, roughly how many tokens will the model consume?

A) 3,000 tokens

B) 750 tokens

C) 12,000 tokens

Question 2

Why is an Instruction-Tuned LLM preferred over a Base LLM for building a chatbot?

A) It is faster at generating text.

B) It uses fewer tokens.

C) It is trained to follow specific tasks and dialogue formats.

Challenge: Token Estimation

Apply the token ratio rule to a real-world scenario.

You are designing an automated summarization system. The system receives daily reports that average 10,000 characters in length.

Your API provider charges $0.002 per 1,000 tokens.

Step 1

Estimate the number of tokens for a single daily report.

Solution:
Using the rule of thumb (1 token ≈ 4 characters):
$$ \text{Tokens} = \frac{10,000}{4} = 2,500 \text{ tokens} $$

Step 2

Calculate the estimated cost to process one daily report.

Solution:
The cost is $0.002 per 1,000 tokens.
$$ \text{Cost} = \left( \frac{2,500}{1,000} \right) \times 0.002 = 2.5 \times 0.002 = \$0.005 $$